Boston Crime(2018) - EDA¶

boston.jpg

Boston, known as the "Hub of the Universe," is a city with a rich history and vibrant culture. Established in 1630, it played a significant role in the American Revolution and boasts numerous historical landmarks. Today, Boston is a thriving metropolis, home to prestigious universities like Harvard and MIT, fostering a climate of innovation and intellectual curiosity. The city's diverse neighborhoods each have their own unique charm, from the cobblestone streets of Beacon Hill to the bustling multicultural hub of Chinatown. With world-class museums, beautiful parks, passionate sports fans, and a lively arts scene, Boston offers a captivating blend of tradition and modernity, making it a captivating destination for residents and visitors alike.

Understand the Hierarchy of City - Boston¶

In Boston, the administrative hierarchy can be represented as follows:

  1. Country: United States

    - The highest level of administrative division, encompassing the entire country.
    
    
  2. State: Massachusetts

    - The state in which Boston is located.
    
    
  3. County: Suffolk County

    - The county in which Boston is located. Suffolk County includes the city of Boston and some neighboring areas.
    
    
  4. City: Boston

    - The city of Boston itself, which is the capital and largest city of Massachusetts.
    
    
  5. Neighborhoods/Districts: Boston is further divided into several neighborhoods or districts, each with its own characteristics and local governance. Some of the well-known neighborhoods in Boston include:

    A1: Downtown
    A15: Charlestown
    A7: East Boston
    B2: Roxbury
    B3: Mattapan
    C6: South Boston
    C11: Dorchester
    D4: South End
    D14: Brighton
    E5: West Roxbury
    E13: Jamaica Plain
    E18: Hyde Park
    
    

These administrative divisions outline the hierarchical structure of Boston's governance and provide a framework for managing and providing services to different areas within the city.

Libraries¶

In [1]:
from encodings.aliases import aliases     # Python has a file containing a dictionary of encoding names and associated aliases
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stat
import seaborn as sns
import pandas as pd
import numpy as np
import calendar
import json
In [2]:
%matplotlib inline

Solving the encoding issue and Reading the File¶

In [3]:
# To find encodings that work, below line creates a set of all available encodings for the specific file and we can use of them;
# Or we can directly open the csv with notepad and know the encoding at the bottom right corner;

alias_values = set(aliases.values())

for encoding in set(aliases.values()):
    try:
        df = pd.read_csv("Miscellaneous/crime.csv", nrows = 5, encoding = encoding)
        print('successful', encoding)
    except:
        pass
successful cp273
successful cp1258
successful iso8859_15
successful cp037
successful cp858
successful gbk
successful iso8859_9
successful iso8859_7
successful cp775
successful cp863
successful cp1252
successful cp857
successful mac_greek
successful cp855
successful cp437
successful cp949
successful mac_roman
successful kz1048
successful mbcs
successful ptcp154
successful cp852
successful iso8859_10
successful cp1254
successful hp_roman8
successful mac_turkish
successful cp869
successful cp1251
successful cp1026
successful cp1255
successful mac_latin2
successful cp850
successful cp1256
successful cp865
successful cp861
successful cp1140
successful cp1250
successful utf_8
successful iso8859_2
successful iso8859_16
successful iso8859_11
successful gb18030
successful cp1257
successful cp1125
successful iso8859_6
successful mac_iceland
successful cp862
successful cp932
successful koi8_r
successful cp500
successful iso8859_14
successful iso8859_13
successful cp866
successful mac_cyrillic
successful iso8859_4
successful latin_1
successful cp1253
successful cp864
successful cp860
successful iso8859_5
successful iso8859_3
In [4]:
# I have used ansi as encoding here as I have found the exact encoding by opening the file in notepad;
# If you don't know the exact encoding go for the above method and use any of the above encodings

crime_df = pd.read_csv("Miscellaneous/crime.csv", encoding = "ANSI", low_memory = False)

crime_df.head()
Out[4]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location
0 I192074715 2629 Harassment HARASSMENT B2 278 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part Two HARRISON AVE 42.331538 -71.080157 (42.33153805, -71.08015661)
1 I192068538 1107 Fraud FRAUD - IMPERSONATION D14 794 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part Two GLENVILLE AVE 42.349780 -71.134230 (42.34977988, -71.13423049)
2 I192005657 2610 Other TRESPASSING C11 396 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part Two MELBOURNE ST 42.291093 -71.065945 (42.29109287, -71.06594539)
3 I192075335 3208 Property Lost PROPERTY - MISSING D4 132 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part Three COMMONWEALTH AVE 42.353522 -71.072838 (42.35352153, -71.07283786)
4 I192013179 619 Larceny LARCENY ALL OTHERS C11 360 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part One CENTERVALE PARK 42.296323 -71.063569 (42.29632282, -71.06356881)
In [5]:
# Creating copy to rollback to samepoint incase if needed

crime = crime_df.copy()
crime.head(1)
Out[5]:
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location
0 I192074715 2629 Harassment HARASSMENT B2 278 NaN 2018-01-01 00:00:00 2018 1 Monday 0 Part Two HARRISON AVE 42.331538 -71.080157 (42.33153805, -71.08015661)

Knowing the primaries of data¶

In [6]:
crime.shape
Out[6]:
(98888, 17)
In [7]:
crime.size             # Size matches shape
Out[7]:
1681096
In [8]:
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98888 entries, 0 to 98887
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   INCIDENT_NUMBER      98888 non-null  object 
 1   OFFENSE_CODE         98888 non-null  int64  
 2   OFFENSE_CODE_GROUP   98888 non-null  object 
 3   OFFENSE_DESCRIPTION  98888 non-null  object 
 4   DISTRICT             98206 non-null  object 
 5   REPORTING_AREA       98888 non-null  object 
 6   SHOOTING             402 non-null    object 
 7   OCCURRED_ON_DATE     98888 non-null  object 
 8   YEAR                 98888 non-null  int64  
 9   MONTH                98888 non-null  int64  
 10  DAY_OF_WEEK          98888 non-null  object 
 11  HOUR                 98888 non-null  int64  
 12  UCR_PART             98868 non-null  object 
 13  STREET               97274 non-null  object 
 14  Lat                  92133 non-null  float64
 15  Long                 92133 non-null  float64
 16  Location             92133 non-null  object 
dtypes: float64(2), int64(4), object(11)
memory usage: 12.8+ MB
In [9]:
total_memory_usage = crime.memory_usage().sum() / 1024**2

print(f"Total memory usage: {total_memory_usage:.2f} MB")
Total memory usage: 12.83 MB

Data Wrangling and Data Engineering¶

In [10]:
# Changing case to lowercase for column names for convenience of calling

print(f'Before: {crime.columns}')
crime.columns = [x.lower() for x in crime.columns]
print(f'After: {crime.columns}')
Before: Index(['INCIDENT_NUMBER', 'OFFENSE_CODE', 'OFFENSE_CODE_GROUP',
       'OFFENSE_DESCRIPTION', 'DISTRICT', 'REPORTING_AREA', 'SHOOTING',
       'OCCURRED_ON_DATE', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'UCR_PART',
       'STREET', 'Lat', 'Long', 'Location'],
      dtype='object')
After: Index(['incident_number', 'offense_code', 'offense_code_group',
       'offense_description', 'district', 'reporting_area', 'shooting',
       'occurred_on_date', 'year', 'month', 'day_of_week', 'hour', 'ucr_part',
       'street', 'lat', 'long', 'location'],
      dtype='object')
In [11]:
# Checking for duplicated rows if any and remove

print(f'Before: {crime.shape}')
      
print(crime.duplicated().sum())
crime.drop_duplicates(inplace=True)

print(f'After: {crime.shape}')
Before: (98888, 17)
161
After: (98727, 17)
In [12]:
# Changing the dtype from object to datetime for the occurred on date column

print(f"Before: {crime['occurred_on_date'].dtype}")

crime['occurred_on_date'] = pd.to_datetime(crime['occurred_on_date'])

print(f"After: {crime['occurred_on_date'].dtype}")
Before: object
After: datetime64[ns]
In [13]:
# Changing the numeric value of months to month names to make it easier to understand plots

crime['month'] = crime['month'].apply(lambda x: calendar.month_name[x])
In [14]:
# Feature Engineering
# To understand the pattern of seasonal crimes

seasons = {'January': 'Winter', 'February': 'Winter', 'March': 'Winter',
           'April': 'Spring', 'May': 'Spring', 'June': 'Spring',
           'July': 'Summer', 'August': 'Summer', 'September': 'Summer',
           'October': 'Fall', 'November': 'Fall', 'December': 'Fall'}

season_index = crime.columns.get_loc('month')

crime.insert(season_index + 1, 'season', crime['month'].map(seasons))
In [15]:
# Feature Engineering
# For better understandability of plots, breaking down district codes into district names

dist_names = {
    'A1': 'Downtown',
    'A15': 'Charlestown',
    'A7': 'East Boston',
    'B2': 'Roxbury',
    'B3': 'Mattapan',
    'C6': 'South Boston',
    'C11': 'Dorchester',
    'D4': 'South End',
    'D14': 'Brighton',
    'E5': 'West Roxbury',
    'E13': 'Jamaica Plain',
    'E18': 'Hyde Park'
}

crime['district'] = crime['district'].replace(dist_names)
crime.head(1)
Out[15]:
incident_number offense_code offense_code_group offense_description district reporting_area shooting occurred_on_date year month season day_of_week hour ucr_part street lat long location
0 I192074715 2629 Harassment HARASSMENT Roxbury 278 NaN 2018-01-01 2018 January Winter Monday 0 Part Two HARRISON AVE 42.331538 -71.080157 (42.33153805, -71.08015661)
In [16]:
# Outlier check on date

print(min(crime['occurred_on_date']), max(crime['occurred_on_date']), sep = 2 * '\n')

# It is well within the daterange;
2018-01-01 00:00:00

2018-12-31 23:45:00
In [17]:
# Breaking down datetime into date & time, taking just date,as hour column is already there & dropping the occurred_on_date column

dt_index = crime.columns.get_loc('occurred_on_date')
crime.insert(dt_index + 1, 'date', crime['occurred_on_date'].dt.date)

crime.drop(columns = 'occurred_on_date', inplace = True)

crime.head(1)
Out[17]:
incident_number offense_code offense_code_group offense_description district reporting_area shooting date year month season day_of_week hour ucr_part street lat long location
0 I192074715 2629 Harassment HARASSMENT Roxbury 278 NaN 2018-01-01 2018 January Winter Monday 0 Part Two HARRISON AVE 42.331538 -71.080157 (42.33153805, -71.08015661)
In [18]:
# Checking for number of unique values in each column

crime.nunique()
Out[18]:
incident_number        86734
offense_code             184
offense_code_group        61
offense_description      185
district                  12
reporting_area           877
shooting                   1
date                     365
year                       1
month                     12
season                     4
day_of_week                7
hour                      24
ucr_part                   4
street                  3579
lat                    13054
long                   13055
location               13062
dtype: int64
In [19]:
# Defining function to check for columns with missing values

def null_cols(df):
    missing = df.columns[df.isnull().sum() != 0]
    return missing
In [20]:
# Checking for the number of missing values in each column

num_of_null_vals = crime[null_cols(crime)].isnull().sum()
num_of_null_vals
Out[20]:
district      682
shooting    98408
ucr_part       20
street       1609
lat          6747
long         6747
location     6747
dtype: int64
In [21]:
num_of_null_vals / len(crime) *  100
Out[21]:
district     0.690794
shooting    99.676887
ucr_part     0.020258
street       1.629747
lat          6.833997
long         6.833997
location     6.833997
dtype: float64

Reasoning behind column and row droppings¶

Column Dropping:

  • incident_number won't do any good for my analysis IMO;

  • As we have offense_code_group, I don't think offense_code would help me out in anyway;

  • offense_description column as well here is unnecessary as we wouldn't be using it anywhere in our analysis;

  • year as we know this is 2018 Boston Crime Dataset and as we don't have multiple years here, can drop it as well;

  • location - As we have lat and long columns seperately I am dropping this;

  • Here as shooting column has almost 100% null values we have no other option than dropping it altogether;

Row Dropping:

  • district, street, latitude and longitude- all of these columns gives essential info about the location of crimes. Without which the entire row would be of no use in most cases where we try to analyse the crime location. As they are in negligible proportion, so as to get a clean data we can consider dropping those rows with null values altogether.

    These rows can't be filled in with some estimated values and if you try to fill in with precise values you will end up doing a tedious job.

  • The ucr_part - column refers to the Uniform Crime Reporting Offense types. The UCR classification system divides offenses into four categories: Part One, Part Two, Part Three, and Others. Part One offenses are considered the most severe and include crimes such as Larceny/Robbery, Assault, and Breaking & Entering. Most of the crimes in Boston on 2018 were classified as UCR Part Three, while ‘Others’ was the least classification with a proportion of just 0.4%.

    ucr part as well has null values in neglible proportion of the dataset. so, we can drop it. Again this can be filled in with precise values.But you will end up doing tedious job which would go out of context;

Note 1: While dropping rows with null values, first drop w/o inplace argument and see the shape of data. If it looks fine then proceed with inplace arg.

Note 2: This decision is subjective and may differ from one analyst’s perspective to another and differ based on context of analysis

In [22]:
print(f'Before: {crime.shape}')

crime.drop(columns = ['incident_number', 'offense_code', 'offense_description', 'year', 'location', 'shooting'], inplace = True)

print(f'After: {crime.shape}')
Before: (98727, 18)
After: (98727, 12)
In [23]:
null_cols(crime)
Out[23]:
Index(['district', 'ucr_part', 'street', 'lat', 'long'], dtype='object')
In [24]:
crime.dropna(axis = 0, ignore_index = True)            # Haven't put inplace = True; Just checking dimensions
Out[24]:
offense_code_group district reporting_area date month season day_of_week hour ucr_part street lat long
0 Harassment Roxbury 278 2018-01-01 January Winter Monday 0 Part Two HARRISON AVE 42.331538 -71.080157
1 Fraud Brighton 794 2018-01-01 January Winter Monday 0 Part Two GLENVILLE AVE 42.349780 -71.134230
2 Other Dorchester 396 2018-01-01 January Winter Monday 0 Part Two MELBOURNE ST 42.291093 -71.065945
3 Property Lost South End 132 2018-01-01 January Winter Monday 0 Part Three COMMONWEALTH AVE 42.353522 -71.072838
4 Larceny Dorchester 360 2018-01-01 January Winter Monday 0 Part One CENTERVALE PARK 42.296323 -71.063569
... ... ... ... ... ... ... ... ... ... ... ... ...
91316 Medical Assistance Hyde Park 555 2018-12-31 December Fall Monday 23 Part Three POPLAR ST 42.277147 -71.125124
91317 Vandalism West Roxbury 564 2018-12-31 December Fall Monday 23 Part Two WASHINGTON ST 42.294217 -71.119853
91318 Medical Assistance Brighton 773 2018-12-31 December Fall Monday 23 Part Three KIRKWOOD RD 42.341269 -71.157506
91319 Investigate Property South End 620 2018-12-31 December Fall Monday 23 Part Three BOYLSTON ST 42.347102 -71.088417
91320 Vandalism Roxbury 318 2018-12-31 December Fall Monday 23 Part Two BROOKLEDGE ST 42.308412 -71.088381

91321 rows × 12 columns

In [25]:
print(f'Before: {crime.shape}')

crime.dropna(axis = 0, inplace = True, ignore_index = True)

print(f'After: {crime.shape}')
Before: (98727, 12)
After: (91321, 12)
In [26]:
null_cols(crime)
Out[26]:
Index([], dtype='object')
In [27]:
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91321 entries, 0 to 91320
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   offense_code_group  91321 non-null  object 
 1   district            91321 non-null  object 
 2   reporting_area      91321 non-null  object 
 3   date                91321 non-null  object 
 4   month               91321 non-null  object 
 5   season              91321 non-null  object 
 6   day_of_week         91321 non-null  object 
 7   hour                91321 non-null  int64  
 8   ucr_part            91321 non-null  object 
 9   street              91321 non-null  object 
 10  lat                 91321 non-null  float64
 11  long                91321 non-null  float64
dtypes: float64(2), int64(1), object(9)
memory usage: 8.4+ MB
In [28]:
crime.columns
Out[28]:
Index(['offense_code_group', 'district', 'reporting_area', 'date', 'month',
       'season', 'day_of_week', 'hour', 'ucr_part', 'street', 'lat', 'long'],
      dtype='object')
In [29]:
crime.describe(include='object')          # Knowing numeric stats holds no value here so going through just object dtype;
Out[29]:
offense_code_group district reporting_area date month season day_of_week ucr_part street
count 91321 91321 91321 91321 91321 91321 91321 91321 91321
unique 59 12 876 365 12 4 7 4 3241
top Motor Vehicle Accident Response Roxbury 111 2018-06-15 May Spring Friday Part Three WASHINGTON ST
freq 9276 14412 692 342 8340 24101 13819 46698 4492

Let's answer some questions¶

Q1: What are the most common crimes in terms of offense type?¶

In [30]:
offense_prop = (crime['offense_code_group'].value_counts(
                  normalize = True).head(15) * 100).to_frame().reset_index().rename(
                                                                             columns = {'offense_code_group': 'offense'})


g = sns.FacetGrid(data = offense_prop, height = 9, aspect = 1.2)

g.map(sns.barplot, 'proportion', 'offense', order = offense_prop['offense'], palette = 'Reds_r', orient = 'h')

g.set(ylabel = '', xlabel =  'Percentage')

# Removing left yticks and labels and placing it over the right

g.despine(left = True)
plt.tick_params(axis = 'y', which = 'both', left = False)        # both means both major and minor ticks
plt.gca().yaxis.set_label_position('right')
plt.gca().yaxis.tick_right()

plt.show()
In [31]:
offense_prop['proportion'].sum()
Out[31]:
77.21006121264551

Of the total 59 types of offense, top 15 types(1/4 th of offense types) amounts to more than 3/4 th of the crimes occured

Q2: Under which UCR part most number of crimes committed ?¶

In [32]:
labels = ['Part 1', 'Part 2', 'Part 3', 'Others']
values = crime['ucr_part'].value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0.6, textinfo='label+percent')])

fig.update_layout(
    title='Crime proportion under each UCR part',
    showlegend=False, height=600
)

fig.show()

The above information would be used for legislation in the US parliament after accumulating the same from every part of the nation. As I described few cells above, part 1 crimes are the most serious crimes and that being the major type looks like an area that needs more attention from legislative body and law enforcement agencies;

Q3: How crimes are distributed across districts?¶

In [33]:
with open('Miscellaneous/map.geojson') as map:
    geojson = json.load(map)

vals = crime['district'].value_counts().reset_index()
merged_data = pd.merge(crime, vals, on='district', how='left')

fig = px.choropleth(merged_data,locations='district', geojson=geojson, featureidkey='properties.name', color='count', color_continuous_scale='Reds')

fig.update_geos(fitbounds='locations', visible=False)
fig.update_layout(title='Crime distribution pattern across districts', coloraxis_colorbar=dict(title='Count'))
fig.update_traces(hovertemplate='<b>District</b>: %{location}<br><b>Number of Crimes</b>: %{z}<extra></extra>')

fig.show()

Q4: How crimes are distributed across locations within the county?¶

In [34]:
fig = px.scatter_mapbox(
    merged_data.drop(['date', 'offense_code_group', 'reporting_area', 'street'], axis=1),
    lat='lat',
    lon='long',
    zoom=10,
    mapbox_style='carto-positron',
    hover_name=merged_data['district'],
    hover_data={
        'date': merged_data['date'],  # Include 'date' in hover_data
        'offense_code_group': merged_data['offense_code_group'],  # Include 'offense_code_group' in hover_data
        'reporting_area': merged_data['reporting_area'],
        'street': merged_data['street']
    }
)

fig.update_geos(fitbounds='locations', visible=False)

fig.update_layout(
    title='Crime distribution pattern across locations'
)

fig.update_traces(
    hovertemplate='<b>District</b>: %{hovertext}<br>'
                  '<b>Date</b>: %{customdata[0]}<br>'
                  '<b>Offense Code Group</b>: %{customdata[1]}<br>'
                  '<b>Reporting Area</b>: %{customdata[2]}<br>'
                  '<b>Street</b>: %{customdata[3]}<extra></extra>'
)

fig.show()

Q5: In which months were the most and least crimes committed?¶

In [35]:
crime_by_month = crime['month'].value_counts().sort_index().reset_index()
crime_by_month = crime_by_month.sort_values('count', ascending=True)

average_crime = crime_by_month['count'].mean()

fig = px.bar(crime_by_month, y='month', x='count', orientation='h', color='count', 
             color_continuous_scale='viridis_r')

fig.add_shape(type="line",
              x0=average_crime, y0=-0.5,
              x1=average_crime, y1=len(crime_by_month) - 0.5,
              line=dict(color="red", width=2, dash="dash"))

fig.add_annotation(x=average_crime, y=len(crime_by_month) - 0.5,
                   text=f'Average: {average_crime:.0f}',
                   showarrow=True, arrowhead=1, ax=-50, ay=-40)

fig.update_layout(
    title='Crime Count by Month',
    xaxis_title='Number of Crimes',
    yaxis_title='Month',
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    coloraxis_colorbar=dict(title='Count')
)

fig.show()
                As the above plot hasn't gave much information I am going for the plot below

Q6: Are there any crime specific trends across months?¶

In [36]:
grouped_data = merged_data.groupby(['offense_code_group', 'month'])['count'].size().reset_index()

grouped_data.rename(
    {'offense_code_group': 'Offense Code Group', 'month': 'Month', 'count': 'Count'},
    axis=1, inplace=True
)

fig = px.bar(
    grouped_data, x='Month', y='Count',
    animation_frame='Offense Code Group',
    range_y=[0, grouped_data['Count'].max() + 100]
)

fig.update_layout(title='Monthly Crime Count by Offense Code Group',
                  xaxis_title='Month',
                  yaxis_title='Crime Count')

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500

fig.show()

Q7: Do number of crimes have any relation to the season?¶

In [37]:
crime_by_season = crime['season'].value_counts().reset_index()
crime_by_season.columns = ['Season', 'Count']

fig = px.pie(crime_by_season, values='Count', names='Season', hole=0.6)

fig.update_layout(
    title='Crime Proportion by Season'
)

fig.show()
The above plot shows that overall crime count is evenly distributed across seasons so I'm breaking it down further below

Q8: Are there any crime specific trends across seasons?¶

In [38]:
grouped_data = merged_data.groupby(['offense_code_group', 'season'])['count'].size().reset_index()

grouped_data.rename(
                    {'offense_code_group': 'Offense Code Group', 'season': 'Season', 'count': 'Count'},
                    axis = 1, inplace = True
                    )

fig = px.bar(
             grouped_data, x='Season', y='Count',
             animation_frame='Offense Code Group',
             range_y=[0, grouped_data['Count'].max() + 100]
            )

fig.update_layout(title='Seasonal Crime Count by Offense Code Group',
                  xaxis_title='Season',
                  yaxis_title='Crime Count')

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500

fig.show()

Q9: In which districts were the most and least crimes committed?¶

In [39]:
crime_by_district = merged_data.groupby('district')['count'].mean().reset_index()
crime_by_district = crime_by_district.sort_values('count', ascending=True)

overall_average_crime = crime_by_district['count'].mean()

fig = px.bar(crime_by_district, y='district', x='count', color='count',
             color_continuous_scale='viridis_r', orientation='h')

fig.add_shape(type="line",
              x0=overall_average_crime, y0=-0.5,
              x1=overall_average_crime, y1=len(crime_by_district) - 0.5,
              line=dict(color="red", width=2, dash="dash"))

fig.add_annotation(x=overall_average_crime, y=len(crime_by_district) - 0.5,
                   text=f'Average: {overall_average_crime:.0f}',
                   showarrow=True, arrowhead=1, ax=-50, ay=-40)

fig.update_layout(
    title='Crime Count by District',
    xaxis_title='Number of Crimes',
    yaxis_title='District',
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    coloraxis_colorbar=dict(title='Count')
)

fig.show()

Q10: Are there more crimes committed on specific days?¶

In [40]:
crime_by_day = crime['day_of_week'].value_counts().sort_index().reset_index()
crime_by_day.columns = ['Day of Week', 'Count']

crime_by_day = crime_by_day.sort_values('Count', ascending=False)

fig = px.bar(crime_by_day, x='Day of Week', y='Count', color='Count', 
             color_continuous_scale='cividis_r')

fig.update_layout(
    title='Crime Count by Day of Week',
    xaxis_title='Day of Week',
    yaxis_title='Number of Crimes'
)

fig.show()

Q11: Are there more crimes committed during specific hours?¶

In [41]:
crime_by_hour = crime['hour'].value_counts().sort_index().reset_index()
crime_by_hour.columns = ['Hour', 'Count']

fig = px.bar(crime_by_hour, x='Hour', y='Count', animation_frame='Hour',
             range_x=[0, 23], range_y=[0, crime_by_hour['Count'].max()],
             labels={'Hour': 'Hour of the Day', 'Count': 'Number of Crimes'})

fig.update_layout(
    title='Crime Count by Hour of the Day',
    xaxis_title='Hour of the Day',
    yaxis_title='Number of Crimes',
    yaxis=dict(title_standoff=0)
)

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1500

fig.show()

Q12: On which days and during which hours most crimes committed?¶

In [42]:
fig, ax = plt.subplots(figsize=(10, 6))

week_and_hour = crime.groupby(['hour', 'day_of_week']).count()['offense_code_group'].unstack()

week_and_hour.columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

heatmap = sns.heatmap(week_and_hour, cmap=sns.cubehelix_palette(as_cmap=True), ax=ax)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)

plt.xlabel('')
plt.ylabel('Hour')

plt.show()

The below table is a treasure trove of key insights¶

In [43]:
offense_counts = crime['offense_code_group'].value_counts().to_frame()

off_unique = crime.groupby('offense_code_group').nunique()
off_unique.insert(0, 'count', offense_counts.iloc[:, 0])
off_unique = off_unique.sort_values('count', ascending=False)

off_unique.columns = [column.replace('_', ' ').title() for column in off_unique.columns]
off_unique.rename_axis('Offense Code Group', axis='index', inplace=True)

off_unique.head(len(off_unique))
Out[43]:
Count District Reporting Area Date Month Season Day Of Week Hour Ucr Part Street Lat Long
Offense Code Group
Motor Vehicle Accident Response 9276 12 830 365 12 4 7 24 1 1665 5385 5382
Larceny 7831 12 747 365 12 4 7 24 1 1179 2850 2850
Medical Assistance 7822 12 817 365 12 4 7 24 1 1697 3954 3953
Other 5254 12 760 365 12 4 7 24 3 1302 2890 2889
Investigate Person 5238 12 776 365 12 4 7 24 1 1433 3099 3099
Simple Assault 4953 12 719 365 12 4 7 24 1 1121 2610 2610
Verbal Disputes 4362 12 649 365 12 4 7 24 1 1177 2261 2261
Drug Violation 4111 12 563 362 12 4 7 24 1 709 1514 1514
Vandalism 4080 12 737 365 12 4 7 24 1 1328 2876 2875
Investigate Property 3575 12 719 365 12 4 7 24 1 1143 2282 2283
Towed 3488 12 657 363 12 4 7 24 1 1085 2305 2305
Property Lost 3341 12 667 365 12 4 7 24 1 876 1939 1940
Larceny From Motor Vehicle 2908 12 682 364 12 4 7 24 1 1113 2215 2215
Aggravated Assault 2254 12 556 363 12 4 7 24 1 740 1490 1490
Fraud 2016 12 655 358 12 4 7 24 1 894 1549 1550
Warrant Arrests 1990 12 498 363 12 4 7 24 1 597 1186 1187
Missing Person Located 1695 12 450 362 12 4 7 24 1 609 914 914
Violations 1359 12 479 348 12 4 7 24 1 488 1035 1035
Residential Burglary 1297 12 471 348 12 4 7 24 1 684 1016 1016
Harassment 1287 12 541 347 12 4 7 24 1 648 1004 1004
Auto Theft 1240 12 501 345 12 4 7 24 1 609 1038 1038
Property Found 1187 12 448 338 12 4 7 24 1 437 763 762
Robbery 1076 12 403 345 12 4 7 24 1 407 831 831
Police Service Incidents 1036 12 415 333 12 4 7 24 1 445 718 718
Missing Person Reported 883 12 312 326 12 4 7 24 2 399 540 540
Confidence Games 842 12 420 315 12 4 7 24 1 421 686 686
Disorderly Conduct 574 12 262 278 12 4 7 24 1 236 407 407
Fire Related Reports 546 12 354 282 12 4 7 24 2 348 500 500
License Violation 514 12 194 213 12 4 7 20 1 143 315 315
Restraining Order Violations 469 12 233 261 12 4 7 24 1 268 338 338
Firearm Violations 460 12 251 246 12 4 7 24 1 243 361 361
Counterfeiting 404 12 269 244 12 4 7 24 1 207 335 336
Recovered Stolen Property 385 12 248 232 12 4 7 24 1 242 347 347
Landlord/Tenant Disputes 348 12 180 218 12 4 7 24 1 212 250 250
Liquor Violation 320 11 97 172 12 4 7 21 1 83 155 155
Auto Theft Recovery 320 12 218 210 12 4 7 23 1 226 288 288
Commercial Burglary 313 12 190 198 12 4 7 24 1 133 247 247
Property Related Damage 263 12 216 161 12 4 7 24 1 198 253 253
Ballistics 259 12 162 185 12 4 7 24 1 199 239 239
Search Warrants 243 12 142 140 12 4 7 24 1 145 179 179
Assembly or Gathering Violations 184 12 97 129 12 4 7 24 1 104 140 140
License Plate Related Incidents 182 12 148 139 12 4 7 21 2 150 167 167
Firearm Discovery 177 12 128 139 12 4 7 23 1 133 148 148
Offenses Against Child / Family 124 12 94 96 12 4 7 22 1 95 112 112
Other Burglary 123 12 100 103 12 4 7 24 1 89 111 111
Operating Under the Influence 113 12 98 97 12 4 7 19 1 91 112 112
Evading Fare 108 12 93 94 12 4 7 24 1 77 102 102
Embezzlement 93 11 73 82 12 4 7 17 1 58 81 81
Prisoner Related Incidents 84 10 66 75 12 4 7 22 2 58 71 71
Service 71 12 68 65 12 4 7 18 1 62 70 70
Homicide 50 10 45 47 12 4 7 19 1 47 50 50
Criminal Harassment 41 10 37 37 11 4 7 18 1 34 37 37
Bomb Hoax 33 9 31 18 12 4 7 13 1 28 32 32
Harbor Related Incidents 33 6 11 31 11 4 7 14 1 16 17 17
Prostitution 30 4 15 19 9 4 6 11 1 16 20 20
Phone Call Complaints 19 7 18 19 11 4 7 14 1 19 19 19
Arson 17 6 16 17 9 4 7 16 1 15 16 16
Aircraft 14 1 1 13 7 4 7 10 1 2 2 2
Explosives 6 5 6 6 3 2 5 5 2 6 6 6